林祥瑞
The lecture has 40 min time limit and we have length slides to consume. We may skip some pages if we have to :3.
In this section, we will have brief intro on information theory. Then we dive into these concepts that are fundamental to deep generative models.
KL-divergence
Jensen–Shannon divergence
The amount of information can be understood as the surpriseness of an event.
That is, rare events give us more information. We need more bits to encode that event in terms of binary encoding.
Suppose we have a fair dice, which is represented by a rv \(X\) with events \(\{1, 2, 3, 4, 5, 6\}\). The amount of information of event \(X = 3\) is measured by \(I(X = 3) = \log_2 \frac{1}{P(X = 3)} \approx 2.58\).
The amount of information can be intuitively understood as # of perfect queries to locate that event.
Suppose our \(X = 3\) example. We ask three yes-no queries to find it out.
Is it greater than or equal to 4? No
Is it greater or equal to 2? Yes
Is it 3? Yes
With perfect yes-no query, it is expected to ask \(\log_2 \frac{1}{P(X = 3)}\) questions.
In fact, we need law of large numbers and many theories to show this idea. It is out of scope in this lecture.
The information content is additive on individual events. Suppose we roll the dice twice. The information content of joint event \(X_1 = 3, X_2 = 2\) is
The (Shannon) entropy of rv \(X\), denoted as \(H(x)\), can be understood as expected amount of information.
Suppose a rv \(X\) and its pdf/pmf \(P(x)\).
The entropy is usually interpreted as the randomness of a random variable.
What if we encode a true distribution \(P\) with artifact distribution \(Q\)?
Encoding based on \(Q\) can be thought of as an imperfect query for \(P\).
The difference \(E_{x \sim P(x)}[\frac{1}{Q(x)}] - E_{x \sim P(x)}[\frac{1}{P(x)}]\) is always non-negative due to Gibb’s inequality.
Suppose you have the knowledge of rv \(Y\), and would like to know the randomness of rv \(X\). It can be modeled as conditional entropy \(H(X|Y)\).
The rationale goes as following:
Given value \(y\), \(H(X | Y = y) = \sum_x P(X = x | Y = y) \log \frac{1}{P(X = x | Y = y)} \)
\(H(X|Y) = \sum_y P(y) H (X | Y = y)\) is expected value over \(y\).
\(H(X | Y)\) is smaller than or equal to \(H(X)\) because you have prior knowledge of \(Y\).
For example, \(X\) indicates whether teacher is inside R442, and \(Y\) is whether you observe teacher walk into R442.
\(H(x) \gt 0\) because you cannot make sure his or her existence. \(H(X|Y)\) is strictly less than \(H(X)\) because you know the outcome of \(X\) if \(Y = \text{true}\).
\(H(X) = H(X|Y)\) if \(X\) and \(Y\) are individual events.
The dependence of rvs \(X, Y\) is modeled as mutual information \(I(X; Y)\).
In fact,
Reference: Wikipedia: Conditional Entropy
Suppose two probability distributions \(P\) and \(Q\). The Kullback–Leibler divergence of \(P\) with respect to \(Q\) is formulated as:
The divergence is non-negative due to Gibb’s inequality.
\(P\) is regarded as true probability, and \(Q\) is regarded as theory.
In coding theory, it measures expected # of extra bits to encode samples from \(P\) using code optimized for \(Q\).
In Bayesian inference, \(Q\) is prior and \(P\) is posterior. In detail,
\(Q = p(\theta)\) the probability over our model parameters \(\theta\).
\(P = p(\theta | x)\) give a new observation \(x\).
The divergence is information gain by revising the belief \(Q\) to \(P\).
KL-div is not symmetric! \(D_{\text{KL}(P \| Q)}\) may not be the same with \(D_{\text{KL}(Q \| P)}\).
Jensen–Shannon divergence fixes this by
where \(M = \frac{1}{2}(P + Q)\)
Reference: GAN — Wasserstein GAN & WGAN-GP
KL-divergence should never be misnamed as KL-metric or KL-distance. Since it is not symmetric, it does not fit the mathematical definition of metric.
\(D_{\text{KL}}(P \| Q)\) could be infinite if \(P(x) \gt 0\) and \(Q(x) = 0\) for some \(x\).
Both KL-div and JS-div does not respect the metric on support, leading to vanishing gradient.
We will introduce Watterstein metric to resolve these issues. === Conclusion
Understand basics of information theory and intuition behind KL-divergence.
Knows Jensen-Shannon divergence.
Vanilla autoencoder and noise autoencoder
Gaussian mixture model
Evidence lower bound (ELBO)
Intuition: Given data sample \(x\), encode it into latent space code \(m\). Then decode it into reconstruction \(x'\).
Trained by minimizing the difference b/w input and reconstruction \(L(x, x')\), usually by a L2 or cross-entropy.
We can represent \(x\) by lower dimensional code \(m\).
The autoencoder may not be robust to slight changes on input \(x\).
One solution is to add some noise \(z\) on input \(x\).
Can we slightly change the code \(m\) to \(m'\), and generate new reasonable sample by decoding \(m'\)?
In fact, it does not work as expected, because both encoder and decoder are non-linear. We cannot expect the latent space has that good property.
Solution: add some noise \(z\) on latent code \(m\).
In VAE, we add a learned noise on latent code as \(c_x = m_x + (\mu_x + e \cdot \exp(\sigma_x)) = m_x + z_x\).
\(x\): the input sample
\(m_x\): vector of latent space code
\(\mu_x\): learned mean (李宏毅的版本沒這一項)
\(\sigma_x\): learned logit of variance
\(\exp(\sigma_x)\): noise variance, exponent is necessary to avoid negative values from neural network
\(e\): the unit Normal noise
\(z_x\): the learned noise
\(q_\theta (z | x)\) is probability function of encoder
\(p_\phi (x | z)\) is probability function of decoder
We add noise regularizar term \(D_{\text{KL}} (q_\theta(z | x) \| p(z))\) to loss, where
\(q_\theta(z|x)\) is the normal distribution \(\mathcal{N}(\mu_x, \exp(\sigma_x))\) given by input \(x\)
\(p(x)\) is unit normal \(\mathcal{N}(0, 1)\)
In implementation, the resulting loss is \(\text{ReconstructionLoss}(x, x') + D_{\text{KL}} (q_\theta(z | x) \| p(z))\)
However, the loss should be \(E_{z \sim q_\theta (z|x)}[\log \frac{1}{p_\phi (x | z)}] + D_{\text{KL}} (q_\theta(z | x) \| p(z))\). We do not adopt this due to impl difficulty.
I found two ways to interpret this model.
李宏毅’s lecture: Maximizing log likelihood \(\mathcal{L} = \sum \log P(x)\) over all observed data sample \(x\).
This tutorial article: Approximating posterior \(p(z | x)\) by \(q_\theta (z | x)\).
We adopt 李宏毅’s version here.
The distribution of evidence term \(P(x)\) is fixed and is intractable to compute.
We can approximate by maximizing likelihood \(\tilde{P}(x) = \tilde{P}(x_1) \tilde{P}(x_2) + \cdots + \tilde{P}(x_n)\) over all sampled data \(x_i\). Note that \(P(x)\) cannot be known, \(\tilde{P}\) is our parameterized function.
In practice, we maximize the log likelihood \(\log \tilde{P} (x) = \log \tilde{P} (x_1) + \log \tilde{P} (x_2) + \cdots + \log \tilde{P} (x_n)\)
Know the design of VAE.
Understand the theory foundation of VAE.
Recap on vanilla GAN
Understanding Wasserstein metric
In context of VAE, we compute the reconstruction error using hand-crafted function.
Why not let the model learn to discriminate the differences?
Reference: From GAN to WGAN - Lilian Weng
We randomly draw \(z\) from latent space.
The generator outputs fake samples \(\tilde{x} = G(z)\)
The discriminator learns to distinguish between true sample \(x\) from dataset and fake samples \(\tilde{x}\) from generator.
The discriminator outputs value from 0 to 1 to answer whether it is a true sample or not.
Suppose the distributions
\(p_z\): distribution over noise input \(z\), usually uniform
\(p_g\): the distribution of generator over data \(\tilde{x}\)
\(p_r\): the distribution over real sample \(x\)
GAN can be formulated as a minmax game with game value \(\min_G \max_D L(D, G)\).
Generator minimizes the profit
Discriminator maximize the profit
The game value is defined as
Theoretically, the value stops at a Nash equilibrium. (and in fact not)
We repeat this loop to train the generator and discriminator:
Unfreeze the generator and freeze the discriminator.
Train the weights of generator by feeding random latent \(z\). Store the generated image in the mean time.
Freeze the generator and unfreeze the discriminator.
Feed real data samples and generated (fake) data samples to train the discriminator.
Some random guy compile the published GANs in this list.
In training session of GAN, each player (generator and discriminator) updates its cost without respecting another player in game.
It’s shown that it’s not guaranteed to coverage to NE under two-player non-cooperative game.
The first player minimizes \(f_1(x) = xy\), while the second minimizes \(f_2(y) = -xy\).
The value oscillates if they minimize values respectively.
Reference: From GAN to WGAN - Lilian Weng
Reference: From GAN to WGAN - Lilian Weng
We omit the explanation here since it requires mathematical background.
Reference: From GAN to WGAN - Lilian Weng
Reference: From GAN to WGAN - Lilian Weng
Reference: From GAN to WGAN - Lilian Weng
Wasserstein GAN (WGAN) is a variant of GAN that adopts Wasserstein metric, also called Earth mover metric.
It has no sign of mode collapse in experiments.
Wasserstein distance can be understood as the minimum effort to move the piles from distribution \(P\) to distribution \(Q\).
Reference: From GAN to WGAN - Lilian Weng
Reference: From GAN to WGAN - Lilian Weng
The transport plan can be thought of as join probability b/w two distributions, where \(\Pi(p_r, p_g)\) is the collection of joint distributions of \(p_r\) and \(p_g\).
However, the formula above is computationally intractable. The authors of WGAN proposed a formula based on Kantorovich-Rubinstein duality. (click link for explanation)
\(\| f \|_L \leq K\) means K-Lipschitz continuous \(\lvert f(x_1) - f(x_2) \rvert \leq K \lvert x_1 - x_2 \rvert\).
The generated samples \(\tilde{x}\) can be replaced with the generator function \(\tilde{x} = g_\theta(z)\), while \(f\) acks as the discriminator.
Based on revised Wasserstein metric, it maximizes the metric \(W\) while making sure \(f\) is K-Lipschitz continuous.
Reference: From GAN to WGAN - Lilian Weng
It maintains Lipschitz continuity by clipping weights! It’s shown that there are side effects.
Many improvements have been done, especially WGAN-GP.
No time for further explanation in this lecture. Here is the link for your interest.
Comprehend design of GANs and its variants.
Know the pros and cons of WGAN.